Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification
نویسندگان
چکیده
Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملA Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملSearching for web information more efficiently using presentational layout analysis
Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a bag of words and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation th...
متن کاملUrban Vegetation Recognition Based on the Decision Level Fusion of Hyperspectral and Lidar Data
Introduction: Information about vegetation cover and their health has always been interesting to ecologists due to its importance in terms of habitat, energy production and other important characteristics of plants on the earth planet. Nowadays, developments in remote sensing technologies caused more remotely sensed data accessible to researchers. The combination of these data improves the obje...
متن کاملVisual Adjacency Multigraphs – a Novel Approach for a Web Page Classification
Standard techniques for a web page classification usually take a simple text-based approach, in which most of the information provided by the visual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical structure – Visual Ad...
متن کامل